Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update worldstate state change behavior in case of FCU #5699

Merged
merged 15 commits into from
Aug 18, 2023

Conversation

matkt
Copy link
Contributor

@matkt matkt commented Jul 13, 2023

PR description

We have identified a potential issue with the FCU mechanism. Based on the logs, it seems that Besu considers each new head as a reorganization (reorg) because of this line https://github.com/hyperledger/besu/blob/main/consensus/merge/src/main/java/org/hyperledger/besu/consensus/merge/blockcreation/MergeCoordinator.java#L610 so we are not updating the worldstate if the new head is not a direct descendant of the current head, which is rare in general. As a result, we enter the "rewindToBlock" method https://github.com/hyperledger/besu/blob/main/consensus/merge/src/main/java/org/hyperledger/besu/consensus/merge/blockcreation/MergeCoordinator.java#L633, which performs a chain reorg (in this case, a move forward of the chain) without changing the world state. However, with Prysm, we only encounter this case and keep advancing the chain without advancing the world state. Consequently, with each "engineNewPayload" call, we need to apply a larger number of trie logs to process the new block, which leads to a node crash.

Upon examining the code, it is indeed not normal to have a scenario where we change the blockchain without changing the world state.

Fixed Issue(s)

Signed-off-by: Karim TAAM <karim.t2am@gmail.com>
@github-actions
Copy link

  • I thought about documentation and added the doc-change-required label to this PR if updates are required.
  • I thought about the changelog and included a changelog update if required.
  • If my PR includes database changes (e.g. KeyValueSegmentIdentifier) I have thought about compatibility and performed forwards and backwards compatibility tests

Signed-off-by: Karim TAAM <karim.t2am@gmail.com>
Signed-off-by: Karim TAAM <karim.t2am@gmail.com>
Signed-off-by: Karim TAAM <karim.t2am@gmail.com>
Signed-off-by: Karim TAAM <karim.t2am@gmail.com>
@ahamlat
Copy link
Contributor

ahamlat commented Jul 31, 2023

I conducted a test on the mainnet using this PR and version 23.4.4. During the experiment, I halted Prysm on both instances for a duration of 2 hours. Upon observation, the node operating with this PR successfully recovered, whereas the node running version 23.4.4 encountered a continuous decline in performance due to the lack of world state updates. Consequently, the node eventually crashed

image

I also tested different Consensus clients with different scenarios

  • Initial Sync (performs backwards sync as last stage)
  • Scenario 1: CL restart
  • Scenario 2: Besu restart
  • Scenario 3: CL short downtime: 60 mins
  • Scenario 4: CL long downtime: more than 24 hours
  • Scenario 5: Besu short downtime: 60 mins
  • Scenario 6: Besu longer downtime: more than 24 hours
Scenario/CL Nimbus Teku Prysm Lighthouse Lodestar
Initial Sync
Scenario 1
Scenario 2 ⚠️ *
Scenario 3 ✅ Backward sync
Scenario 4 ✅ Backward sync
Scenario 5 ✅ Backward sync
Scenario 6 ✅ Backward sync ✅ Backward sync ✅ Backward sync
  • : When restarting Besu with Nimbus as a CL, Besu receives new payload calls on chainhead+2 blocks. This triggers a backward sync for each new payload because the parent of the block doesn't exist in the blockchain. Eventually, Besu is able to recover from this loop of backward syncs with one block (see logs below).

Also, setting a finalized block to the last finalized block on newPayload calls is a bit confusing.

{"@timestamp":"2023-08-01T11:34:00,218","level":"INFO","thread":"vert.x-worker-thread-0","class":"MergeCoordinator","message":"maybeHeadHeader is not present","throwable":""}
{"@timestamp":"2023-08-01T11:34:00,218","level":"INFO","thread":"vert.x-worker-thread-0","class":"MergeCoordinator","message":"chain head = 4005502 ","throwable":""}
{"@timestamp":"2023-08-01T11:34:00,218","level":"DEBUG","thread":"vert.x-worker-thread-0","class":"MergeCoordinator","message":"Appending new head block hash 0xd7beb71befb73a0cb426d2d00b977d7ae7355fb4402ecf8522fccd8d79671e53 to backward sync","throwable":""}
{"@timestamp":"2023-08-01T11:34:00,218","level":"INFO","thread":"vert.x-worker-thread-0","class":"BackwardSyncContext","message":"Starting a new backward sync session","throwable":""}
{"@timestamp":"2023-08-01T11:34:00,244","level":"INFO","thread":"vert.x-worker-thread-0","class":"MergeCoordinator","message":"maybeHeadHeader is not present","throwable":""}
{"@timestamp":"2023-08-01T11:34:00,244","level":"INFO","thread":"vert.x-worker-thread-0","class":"MergeCoordinator","message":"chain head = 4005502 ","throwable":""}
{"@timestamp":"2023-08-01T11:34:00,244","level":"DEBUG","thread":"vert.x-worker-thread-0","class":"MergeCoordinator","message":"Appending new head block hash 0x507d68e4e715e1aa1a761fa2a2af71c4cd8807793c521f8ac99618675efe15b1 to backward sync","throwable":""}
{"@timestamp":"2023-08-01T11:34:01,423","level":"INFO","thread":"nioEventLoopGroup-3-2","class":"BackwardSyncContext","message":"Backward sync phase 2 of 2 completed, imported a total of 1 blocks. Peers: 3","throwable":""}
{"@timestamp":"2023-08-01T11:34:01,424","level":"INFO","thread":"nioEventLoopGroup-3-2","class":"BackwardSyncAlgorithm","message":"Current backward sync session is done","throwable":""}
{"@timestamp":"2023-08-01T11:34:01,503","level":"DEBUG","thread":"nioEventLoopGroup-3-2","class":"MergeCoordinator","message":"Setting finalized block header to 4005417 (0x341c270f8d1012eecac0572746e04fc616a1fbf183ac3c46f9b80d7403343579)","throwable":""}
{"@timestamp":"2023-08-01T11:34:01,503","level":"DEBUG","thread":"nioEventLoopGroup-3-2","class":"MergeCoordinator","message":"Setting finalized block header to 4005444 (0x6c7ce692d6ef60076df9bbc33c2e41bfae7551028c599cfab7d0fa89b5ffc61f)","throwable":""}
{"@timestamp":"2023-08-01T11:34:12,103","level":"INFO","thread":"vert.x-worker-thread-0","class":"MergeCoordinator","message":"maybeHeadHeader is not present","throwable":""}
{"@timestamp":"2023-08-01T11:34:12,103","level":"INFO","thread":"vert.x-worker-thread-0","class":"MergeCoordinator","message":"chain head = 4005503 ","throwable":""}
{"@timestamp":"2023-08-01T11:34:12,103","level":"DEBUG","thread":"vert.x-worker-thread-0","class":"MergeCoordinator","message":"Appending new head block hash 0x507d68e4e715e1aa1a761fa2a2af71c4cd8807793c521f8ac99618675efe15b1 to backward sync","throwable":""}
{"@timestamp":"2023-08-01T11:34:12,103","level":"INFO","thread":"vert.x-worker-thread-0","class":"BackwardSyncContext","message":"Starting a new backward sync session","throwable":""}
{"@timestamp":"2023-08-01T11:34:12,120","level":"INFO","thread":"vert.x-worker-thread-0","class":"MergeCoordinator","message":"maybeHeadHeader is not present","throwable":""}
{"@timestamp":"2023-08-01T11:34:12,121","level":"INFO","thread":"vert.x-worker-thread-0","class":"MergeCoordinator","message":"chain head = 4005503 ","throwable":""}
{"@timestamp":"2023-08-01T11:34:12,121","level":"DEBUG","thread":"vert.x-worker-thread-0","class":"MergeCoordinator","message":"Appending new head block hash 0x5bdf6acda6c7589286199ff50faa43f335ca339d4717c22fba508c6607505673 to backward sync","throwable":""}
{"@timestamp":"2023-08-01T11:34:12,591","level":"INFO","thread":"nioEventLoopGroup-3-2","class":"BackwardSyncContext","message":"Backward sync phase 2 of 2 completed, imported a total of 1 blocks. Peers: 3","throwable":""}
{"@timestamp":"2023-08-01T11:34:12,591","level":"INFO","thread":"nioEventLoopGroup-3-2","class":"BackwardSyncAlgorithm","message":"Current backward sync session is done","throwable":""}
{"@timestamp":"2023-08-01T11:34:12,700","level":"DEBUG","thread":"nioEventLoopGroup-3-2","class":"MergeCoordinator","message":"Setting finalized block header to 4005417 (0x341c270f8d1012eecac0572746e04fc616a1fbf183ac3c46f9b80d7403343579)","throwable":""}
{"@timestamp":"2023-08-01T11:34:12,700","level":"DEBUG","thread":"nioEventLoopGroup-3-2","class":"MergeCoordinator","message":"Setting finalized block header to 4005444 (0x6c7ce692d6ef60076df9bbc33c2e41bfae7551028c599cfab7d0fa89b5ffc61f)","throwable":""}

@matkt matkt changed the title update fcu Update worldstate state change behavior in case of FCU and EngineNewPayload Aug 1, 2023
@matkt matkt marked this pull request as ready for review August 1, 2023 08:35
Signed-off-by: ahamlat <ameziane.hamlat@consensys.net>
Signed-off-by: Ameziane H <ameziane.hamlat@consensys.net>
Copy link
Contributor

@garyschulte garyschulte left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me, but we should call out explicitly that we are using backward sync as an asynchronous way of moving head. We would probably benefit from having an explicit behavior for this whether it is part of backward sync service or not. As it is, when we encounter this condition, I believe we are going to unnecessarily download blocks we already have in blockchain storage.

if (maybeHeadHeader.isPresent()) {
LOG.atDebug()
if (maybeHeadHeader.isPresent()
&& Math.abs(maybeHeadHeader.get().getNumber() - chainHead) < 500) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is only necessary for bonsai, perhaps we can defer to the WorldStateArchive to provide this limit? eventually an archive-friendly version of bonsai would not need this also

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we use backward sync mechanism as an asynchronous worldstate move, we should probably make backward sync aware of the fact that we already have those blocks. That is an optimization though and we can do that in a subsequent PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed this modification and reduced the scope of this PR. normally it will be enough to fix the problem (testing now)

if (maybeHeadHeader.isPresent()) {
LOG.atDebug()
if (maybeHeadHeader.isPresent()
&& Math.abs(maybeHeadHeader.get().getNumber() - chainHead) < 500) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we replace the number here for a named constant to describe where it comes from?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good Catch. Indeed I used this number for the tests but we must definitely use the Max Layer that we have for bonsai

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

public static final long RETAINED_LAYERS = 512; // at least 256 + typical rollbacks

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't that configurable though?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes there is a flag to change this value

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we defer to a WorldStateArchive method that provides this limit, we can skip the behavior for forest since it should not have a worldstate move issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed this part. I will just keep the worldtsate head update modification https://github.com/hyperledger/besu/pull/5699/files#diff-6da2d65dd7a0564527e8ce3ad7c4482f599fb2af30e37316c47854e5aed2e977R610 . I'm doing new test now with this last version

ahamlat and others added 6 commits August 4, 2023 11:11
Signed-off-by: Ameziane H <ameziane.hamlat@consensys.net>
Signed-off-by: Ameziane H <ameziane.hamlat@consensys.net>
Signed-off-by: Ameziane H <ameziane.hamlat@consensys.net>
Signed-off-by: Karim TAAM <karim.t2am@gmail.com>
@matkt
Copy link
Contributor Author

matkt commented Aug 16, 2023

I conducted the same test as @ahamlat on the mainnet using this PR (after my last modification)

  • Initial Sync (performs backwards sync as last stage)
  • Scenario 1: CL restart
  • Scenario 2: Besu restart
  • Scenario 3: CL short downtime: 60 mins
  • Scenario 4: CL long downtime: more than 24 hours
  • Scenario 5: Besu short downtime: 60 mins
  • Scenario 6: Besu longer downtime: more than 24 hours
Scenario/CL Nimbus Teku Prysm Lighthouse Lodestar
Initial Sync
Scenario 1
Scenario 2
Scenario 3
Scenario 4
Scenario 5
Scenario 6

@matkt matkt changed the title Update worldstate state change behavior in case of FCU and EngineNewPayload Update worldstate state change behavior in case of FCU Aug 17, 2023
Copy link
Contributor

@garyschulte garyschulte left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes much more sense than the prior behavior. I don't think we should have been doing a chain reorg without moving the worldstate.

🚢

}

private boolean forwardWorldStateTo(final BlockHeader newHead) {
private boolean moveWorldStateTo(final BlockHeader newHead) {
Copy link
Contributor

@gfukushima gfukushima Aug 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This rename implies that we could use this method to move the ws "backwards", is that correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes we can do rollback and rollfoward with this code

@matkt matkt merged commit 9706bd7 into hyperledger:main Aug 18, 2023
8 checks passed
garyschulte pushed a commit to garyschulte/besu that referenced this pull request Aug 28, 2023
)

Update the worldstate in the same way as the blockchain in order to avoid having inconsistencies between the two and later trigger a big rollforward

Signed-off-by: Karim TAAM <karim.t2am@gmail.com>
Co-authored-by: Ameziane H <ameziane.hamlat@consensys.net>
garyschulte pushed a commit to garyschulte/besu that referenced this pull request Aug 28, 2023
)

Update the worldstate in the same way as the blockchain in order to avoid having inconsistencies between the two and later trigger a big rollforward

Signed-off-by: Karim TAAM <karim.t2am@gmail.com>
Co-authored-by: Ameziane H <ameziane.hamlat@consensys.net>
garyschulte pushed a commit to garyschulte/besu that referenced this pull request Aug 28, 2023
)

Update the worldstate in the same way as the blockchain in order to avoid having inconsistencies between the two and later trigger a big rollforward

Signed-off-by: Karim TAAM <karim.t2am@gmail.com>
Co-authored-by: Ameziane H <ameziane.hamlat@consensys.net>
Signed-off-by: garyschulte <garyschulte@gmail.com>
eum602 pushed a commit to lacchain/besu that referenced this pull request Nov 3, 2023
)

Update the worldstate in the same way as the blockchain in order to avoid having inconsistencies between the two and later trigger a big rollforward

Signed-off-by: Karim TAAM <karim.t2am@gmail.com>
Co-authored-by: Ameziane H <ameziane.hamlat@consensys.net>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
mainnet TeamChupa GH issues worked on by Chupacabara Team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants